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Abstract 

We have analyzed DNA sequences of known genes from 16 yeast chromosomes {Saccha- 
romyces cerevisiae) in terms of ohgonucleotides. We have noticed that the relative abundances 
of oligonucleotide usage in the genome follow a long-tail Levy-like distribution. We have ob- 
served that long genes often use strongly over-represented and under-represented nucleotides, 
whereas it was not the case for the short genes (shorter than 300 nucleotides) under consider- 
ation. If selection on the extremely over-represented/under-represented oligonucleotides was 
strong, long genes would be more affected by spontaneous mutations than short ones. 
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1 Introduction 



Since 1995 more than thirty fuU genomes have been sequenced and information on the sequences of 
hundreds of miUions of nucleotides has been available for the scientific community. This has opened 
a new field of research, DNA statistical analysis, where genomic sequences are analyzed both in 
terms of single nucleotides, oligonucleotides and thousands of nucleotides. In the last case there 
are studies on the power spectral density and correlation function, especially the question of the 
existence of statistical long-range base-base correlation. Long-range correlation in DNA was first 
observed in 1992 by three groups, Li et al. jl], ||, Peng et al. ||] and Voss This has been a very 
active topic until now. We will not put a long list of references here but we cite only some of the the 
recent papers by H.E. Stanley et al. |^, Arneodo et a l. ||^ and Vieira ^ addressing the problem 
directly. One can also visit a WWW home page by Li ( http:/ /linkage. rockefeller, edu /wli/dna_ corr ) 
for the references to this particular topic. We have also contributed to the topic, e.g., in the papers 
[^-||l^ we show that replication, which is an asymmetric process, is responsible for introducing 
strong trends in the third bases of codons and in consequence it causes the long-range base-base 
correlations. 

The examinination of the long-range correlation in DNA is strongly connected with the sta- 
tistical methods applied to texts in natural languages ||l^-[|l^, where usually one calculates the 
frequency f{k) of each word in a text (fc = 1, 2, . . . , N). If the words in the text are arranged in 
rank order, from most frequent to least frequent, so that /(I) > /(2) > /(3) . . . > f{N) then one 
observes a power law (Zipf law), /(fc) cx 1/k'', with an exponent ^ and typically ^ ^ 1.0 for natural 
languages. This analysis also applies well to studying short-range time correlations in financial 
signals jl^, |l^. In DNA the words are composed from an "alphabet" of four letters A,T,G,C 
representing the nucleotides adenine, thymine, guanine and cytosine. The 7i-tuplets are termed 
"n- words". The biological meaning of these n- words depends on the value of n. Typically in the 
case of coding DNA sequences the words are considered to be 3-tuplets because three nucleotides 
(codons) code for one amino acid. This triplet structure of DNA coding sequences can be easily 
detected with the help of the power spectrum because there is a sharp peak at frequency / = 1/3 
in the spectrum. The connection of the peak with the codon structure has been reported already 
by Voss 0] in 1992 during discussion of the long-range correlations in DNA. This peak reflects the 
asymmetry of codons. For example, in the case of the yeast genome (Saccharomyces cerevisiae) 
more then 75% of all genes have more A than T in the first and second positions in codons, more 
G than C in the first positions of codons, and less G than C in the second positions jl^. Codons 
for hydrophobic amino acids are rich in T in the second positions whereas codons for hydrophilic 
amino acids are rich in A in the second positions. In particular, the genes with lower number of A 
than T in the second positions in codons represent genes coding for transmembrane proteins. Thus, 
considering 3-tuplets in coding regions to be the words in DNA texts is quite natural, contrary to 
the words for noncoding regions which are not known. On the other hand, the observation of a 
much smaller peak around / ^ 1/11 in DNA power spectrum [l^ , [pT| makes the understanding 
the DNA words more complex. Namely, this peak might be related to DNA folding structure. The 
detailed discussion of the meaning of the peak can be found in a paper by Trifonov as well as a 
discussion of other recognized periodicities in genome sequences, 200— and 400— base periodicities. 

The results of Zipf analysis of 40 DNA sequences have been discussed in detail by Mantegna et 
al. ||l2| . They found that the Zipf exponent ( for a noncoding region is about 50% larger than that 
for coding regions and thus noncoding sequences are closer to natural languages with respect to 
their information content than the coding ones. Note however that also noncoding regions of DNA 
can possess a strong signal / = 1/3 ||2^. The reason is that there can be found both sequences 
which were coding in past and sequences which may be recognized as genes in future. 

The studies of oligonucleotides (n-words) in recent years indicate that they can play the role of 
a genomic signature. Karlin and Burge, in their paper pl[ | showed that the relative abundance of 
dinucleotides (2-words) can discriminate DNA sequences of different organisms. The abundances, 
particularly for CG and TA can reflect the species-specific replication and repair mechanisms (see 
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also Karlin, Mrazek and Campbell They analyzed different dinuclcotides with the help of 

effective frequencies: 



z — 




(1) 



where Pi denoted the frequency of nucleotide i — A, T, G, C) and Pij denoted the frequency of 
dinucleotide ij under consideration. In particular, they suggested that CG under-representation 
should be advantageous for organisms which have small genomes and need to replicate rapidly. On 
the other hand TA under-representation renders DNA more flexible for unwinding. The concept 
of genomic signature has been been extended recently to n-words where the chaos game 
representation of DNA sequences in the form of fractal images has been used following the method 
developed by Jeffrey |2^. We address this paper because we have introduced a similar concept of 
DNA representation independently in our paper [ p5| . In the method, the frequencies of n- words 
are represented by a complex landscape of "hills" and "valleys" located on a square board. An 
example of such a chaos game representation of a DNA sequence is presented in Figj^ for 6-tuplets 
constructed from nucleotides located in the flrst base position in codons, second base position 
in codons and third base position in codons in the case of the yeast genes. One can observe 
asymmetric usage of the 6-words. A similar result can be obtained for other values n of length of 
words under consideration. 

In general, one can observe many oligomer repeats in the "hills" of the landscape, especially 
if one includes noncoding DNA regions. Their number is closely related to the mutation pressure 
and selection. The statistical properties of short oligonucleotides have been discussed recently by 
Buldyrev et al. p^ . In particular, they showed that the number of dimeric tandem repeats in 
coding DNA sequences is exponential, whereas in noncoding sequences it is more often described 
by a power law. 

In the following we restrict ourselves to statistical analysis of 6-tuplets only and to this aim we 
have considered the relative frequencies of 6-words by a simple generalization of the Ec^.^ 



where PijUmn is the frequency of the word ijklmn in the genome under consideration, and Pi, 
Pj, P„ are the respective nucleotide frequencies {i,j,k,l,m,n — A,T,G,C). Thus, z — 1 
means that for a chosen 6-word the frequency of its usage in the genome is the same as the 
expected probability calculated from the nucleotide occurrence. Both under-representation and 
over-representation of 6-words might introduce the short-range correlation effects. If the words 
have a biological sense in DNA texts, they will be correlated at least in the region of a gene. 

2 DNA words versus mutations and selection 

The choice of variables z in Eq.|| to represent effective frequencies of 6-words instead of absolute 
frequencies Pijkimn guarantees that trivial correlations, the artefacts coming from the nucleotide 
bias, have been removed. Thus, if the numbers z associated with the respective 6-words represent 
biased random values only, their Zipf plot should be horizontal. We would like to address the paper 
by Vandewall and Ausloos jwj who used this argumentation in their analysis of financial data - 
daily fluctuations of the Apple stock price. 

We analyze separately three gene subsequences, obtained by splicing nucleotides from posi- 
tion (1) in codons, position (2) in codons and position (3) in codons. Next, the three resulting 
nucleotide sequences are partitioned into non-overlapping 6-tuplets. Note that some 6-tuplets 
can be strongly under-represented. The reason is that 6-tuplets are already gene-specific and 
in the extreme case it can happen that a 6-tuplet from a gene under consideration does not 
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appear in any other genes. This could introduce a strong correlation effect. Therefore, the val- 
ues of z in Eq.^ for a gene under consideration are calculated with the help of the frequencies 
Pijkimn, Pii Pj i • ■ ■ i Pn in & bank of 6- words representing all genes except the considered one. 
In FigJ^ we present the Zipf plots done for 6-tuplets in the case of 2772 yeast genes taken from 



ftp : // genome — ftp. stanf ord.edu/pub/yeast/ genome^eq/all.gcg. The results suggest that we 



can expect non-trivial correlations between successive 6-words. The reason for the observed step- 
like structure in Figj^ is that some deviations of Pijkimn from the expected value are more frequent 
than others and, in general, the probability of choosing the next word may depend on several of 
the preceding words. 

Note that the representation of genes by the effective frequencies (Eq.||) of their 6-tuplets loses 
some information concerning base arrangement. It is often the case that different 6-tuplets have 
exactly the same deviation of Pijkimn from the expected value in the genome. Thus a question could 
arise: is the z representation of genes consistent with the Levy walk analog of a two-dimensional 
DNA walk in space (A-T,G-C), discussed by Abramson, Alemany and Cerdeira |^? In |^ it 
has been shown that the mean square displacement of the DNA walker follows the power law 
< r^(s) >~ s", where s denotes the number of steps and a ^ 1.5 for yeast chromosomes. Once 
1 < a < 2 this walk corresponds to the Levy walk. We could expect that the distribution of the 
effective frequencies of 6-words should keep the memory of the Levy flights performed in space 
(A-T,G-C). To show this, it will be convenient for us to introduce a new variable z' , which is the 
effective frequency z defined in Eq.^ shifted by 1: 

z' = z-1. (3) 

In Fig.^ we plotted the distribution of numbers z' representing yeast genes solely from one DNA 
strand. Almost the same distribution we have got for the numbers z' calculated for genes located 
in the complementary strand. In Fig.^, we symmetrized the distribution of the numbers z' by 
introducing the values —z' in the case of 6-tuplets of the complementary DNA strand. The long- 
tail feature of the distribution of the numbers z' is compared in the figure with a Levy flight 
distribution calculated for the exponent a — 1.5, a value characteristic for the yeast genome 
p7[ |. The property of the large variance of z is consistent with the suggestion of non trivial 
correlation by Fig.^. The results are consistent also with other data presented in Fig.^ where 
for each gene length the maximum value of z' has been plotted. In the flgure we can observe a 
trend that long genes use more strongly over-represented 6-words than short genes. An analogous 
situation we have noticed in the case of under-represented 6-words. If selection on the extremely 
over- represented/under-represented 6-words was strong, long genes would be more affected by 
spontaneous mutations than short genes. This is suggested also by the results j25| of our Monte 
Carlo simulations of gene evolution under constant mutation pressure and selection, where we 
showed that short genes accumulate more mutations per gene length than the long ones |2^ . The 
fact that spontaneous mutation rates per nucleotide are inversely correlated with genome size has 
been flrst discussed by Drake et al. and later by Karlin and Burge ||2l|. Our results might 
relate this phenomenon to the strong over-representation and under-representation of some n- words 
representing oligonucleotides. 
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Figure Captions 



Fig.l The chaos game representation of the yeast genes in case of 6- words constructed from the 
first base position in codons (left), second base position in codons (middle), third base 
position in codons (right). 

Fig. 2 Zipf plots for three DNA subsequences originating from 2772 yeast genes, separately for 
bases in position (1) in codons, position (2) in codons, and position (3) in codons in the 
case when the effective frequencies z of 6-words were used. 

Fig. 3 Distribution of the numbers z' representing 6-words specific for base position (2) in codons 
of the yeast genes. Here only the genes located at one DNA strand have been considered. 

Fig. 4 Distribution of the numbers z' representing 6-words of yeast genes at base position (2) in 
codons. We associated a value z' with genes of one strand, and a value —z' with genes of 
the complementary DNA strand. The continuous line represents the Levy distribution with 
a = 1.5. 

Fig. 5 For each gene length (in nucleotides), the maximum value of z has been recorded. 2722 

yeast genes have been analyzed. 
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Figure 1: The chaos game representation of the yeast genes in case of 6- words constructed from 
the first base position in codons (left), second base position in codons (middle), third base position 
in codons (right). 



100 



CL 

E 

Q. 
CL 
CL 

h. 

CL 




base position (1) in codons 

base position (2) in codons 
base position (3) in codons 



100 



10000 



1000000 



ranl< 



Figure 2: Zipf plots for three DNA subsequences originating from 2772 yeast genes, separately for 
bases in position (1) in codons, position (2) in codons, and position (3) in codons in the case when 
the effective frequencies z of 6- words were used. 
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Figure 3: Distribution of the numbers z' representing 6- words specific for base position (2) in 
codons of the yeast genes. Here only the genes located at one DNA strand have been considered. 
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Figure 4: Distribution of the numbers z' representing 6-words of yeast genes at base position (2) 
in codons. We associated a value z' with genes of one strand, and a value —z' with genes of the 
complementary DNA strand. The continuous line represents the Levy distribution with a. = 1.5. 
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Figure 5: For each gene length (in nucleotides), the maximum value of z has been recorded. 2722 
yeast genes have been analyzed. 
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